Duke University
Canonical Regression Model & Bayesian Model Averaging
Estimation via MCMC Monte Carlo Frequencies
Probability Proportional to Size Sampling in Finite Populations
Adaptive Independent Metropolis/Adaptive Importance Sampling
Observe response vector \(\mathbf{Y}\) with predictor variables \(X_1 \dots X_p\).
Model for data under a specific model \({\cal M}_\gamma\): \[\begin{equation*} \mathbf{Y}\mid \alpha, \boldsymbol{\beta_\gamma}, \phi, {\cal M}_\gamma\sim \textsf{N}(\mathbf{1}_n\alpha + \mathbf{X}_{\boldsymbol{\gamma}}\boldsymbol{\beta_\gamma}, \mathbf{I}_n/\phi) \label{eq:linear.model} \end{equation*}\]
Models \({\cal M}_\gamma\) encoded by \(\boldsymbol{\gamma}= (\gamma_1, \ldots \gamma_p)^T\) binary vector with \(\gamma_j = 1\) indicating that \(X_j\) is included in model \({\cal M}_\gamma\) where \[\begin{align*} \gamma_j = 0 & \Leftrightarrow \beta_j = 0 \\ \gamma_j = 1 & \Leftrightarrow \beta_j \neq 0 \end{align*}\]
\(\mathbf{X}_{\boldsymbol{\gamma}}\): the \(n \times p_{\boldsymbol{\gamma}}\) design matrix for model \({\cal M}_\gamma\)
\(\boldsymbol{\beta_\gamma}\): the \(p_{\boldsymbol{\gamma}}\) vector of non-zero regression coefficients under \({\cal M}_\gamma\)
intercept \(\alpha\), precision \(\phi\) common to all models
prior distributions on all unknowns \((\boldsymbol{\beta}_{{\cal M}_\gamma}, {\cal M}_\gamma, \alpha_{{\cal M}_\gamma}, \phi_{{\cal M}_\gamma})\) and turn the Bayesian crank to get posterior distributions!
for nice priors, we can integrate out the parameters \(\boldsymbol{\theta}_{\boldsymbol{\gamma}}= (\boldsymbol{\beta_\gamma}, \alpha_{{\cal M}_\gamma}, \phi_{{\cal M}_\gamma})\) to obtain the marginal likelihood of \({\cal M}_\gamma\) \[\begin{align*} p(\mathbf{Y}\mid {\cal M}_\gamma) & = \int p(\mathbf{Y}\mid \boldsymbol{\theta}_{\boldsymbol{\gamma}}, {\cal M}_\gamma)p(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\mid {\cal M}_\gamma) d\boldsymbol{\theta}_{\boldsymbol{\gamma}}\\ p({\cal M}_\gamma\mid\ Y) & = \frac {p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)} {\sum_{\boldsymbol{\gamma}\in \boldsymbol{\Gamma}} p(\mathbf{Y}\mid {\cal M}_\gamma)p({\cal M}_\gamma)} \end{align*}\]
posterior distribution of quantities \(\Delta\) of interest under BMA \[ \sum_{\boldsymbol{\gamma}\in \boldsymbol{\Gamma}} p({\cal M}_\gamma\mid \mathbf{Y}) p(\Delta \mid \mathbf{Y}, {\cal M}_\gamma) \]
estimation \(\textsf{E}[\boldsymbol{\mu}\mid \mathbf{Y}]\), \(p(\mathbf{Y}^* \mid \mathbf{Y})\), marginal inclusion probabilities \(P(\gamma_j = 1 \mid \mathbf{Y})\)
Use a sample of models from \(\boldsymbol{\Gamma}\) to approximate the posterior distribution of models
design a Markov Chain to transition through \(\boldsymbol{\Gamma}\) with stationary distribution \(p({\cal M}_\gamma\mid \mathbf{Y})\) \[ p({\cal M}_\gamma\mid \mathbf{Y}) \propto p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma) \]
propose a new model from \(q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})\)
accept moving to \(\boldsymbol{\gamma}^*\) with probability \[ \textsf{MH} = \max(1, \frac{p({\cal M}_\gamma* \mid \mathbf{Y})p({\cal M}_\gamma^*)/q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})} {p({\cal M}_\gamma\mid \mathbf{Y})p({\cal M}_\gamma)/q(\boldsymbol{\gamma})}) \]
otherwise stay at model \({\cal M}_\gamma\)
models are sampled proportional to their posterior probabilities as \(T \to \infty\)
Estimate the probabilities of models via Monte Carlo frequencies of models or ergodic averages \[\begin{align*} \widehat{p({\cal M}_\gamma\mid \mathbf{Y})} & = \frac{\sum_{t = 1}^T I({{\cal M}}_t = {\cal M}_\gamma)} {T} \\ & = \frac{\sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}} I({\cal M}_\gamma\in S)} {\sum n_{\boldsymbol{\gamma}}} \\ \end{align*}\]
\(T\) = # MCMC samples
\(S\) is the collection of unique sampled models
\(n_{\boldsymbol{\gamma}}\) is the frequency of model \({\cal M}_\gamma\) in \(S\)
\(n = \sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}}\) total number of unique models in the sample
asymptotically unbiased as \(T \to \infty\)
fundamentally unsound to a Bayesian ! (O’Hagan 1987, The Statistician)
ignores observed information in the marginal likelihoods \(\times\) prior probabilities!
Can view MH as a form of Probability Proportional to Size Sampling (PPS) With Replacement
can we do better using ideas from Finite Population Sampling?
Let \(q({{\cal M}}_i)\) be the probability of selecting \({{\cal M}}_i\)
Goal is to estimate \(C= \sum_i^N p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\)
Hansen-Hurwitz (1943) may be viewed as an importance sampling estimate \[\hat{C} = \frac{1}{n}\sum_i^n \frac{ n_i p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{q({{\cal M}}_i)} \]
If we have “perfect” samples from the posterior then \(q({{\cal M}}_i) = \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{C}\) and recover \(C\)!
Since \(C\) is unknown, apply the ratio HH estimator (or self-normalized IS) \[ \hat{C} = \frac{\frac1 n \sum_i^n \frac{n_i p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{q({{\cal M}}_i)}}{ \frac1 n \sum_i^n \frac{1}{q({{\cal M}}_i)}} = \left[ \frac{1}{n} \sum_i \frac{n_i}{p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)} \right]^{-1} \]
But this recovers the “infamous” harmonic mean estimator of Newton & Raftery (1994) - while unbiased, it’s is highly unstable!
inclusion probability that \(\boldsymbol{\gamma}_i \in S\) - under sampling with replacement \(\pi_i = 1 - (1 - q({{\cal M}}_i))^\text{T}\)
HT estimate of normalizing constant: \[\quad \hat{C} = \frac{1}{n} \sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}\] (dominates HH, unique hyper-admissible estimate of \(C\))
Hájek (1971) estimator uses an auxilary variable \(A_i > 0\), where we expect \(p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i) \propto A_i\), with \(A \equiv \sum_{i = 1}^{N} A_i\) \[\hat{C} = \frac{\sum_{i=1}^n \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}} {\sum_{i=1}^n\frac{A_i/A}{\pi_i}}\] may be preferable when \(p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)\) are weakly correlated with \(\pi_i\) or when \(n\) is not fixed
Basu’s (1971) famous circus example illustrated potential problems with the Horvitz-Thompson estimator (similar problem arises with IS)
violates the likelihood principle
once we have samples, \(p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\) are fixed and the sampling probabilities are not relevant
only randomness is for the remaining units that were not sampled. (which is related to the sampling design)
Basu’s estimate (using \(\pi_i = A_i/A\)), \[C = \sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \left( \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{\pi_i} \right) \times \left(\sum_{i \notin S} \pi_i \right)\]
conditions on the observed data sum and estimates remaining
Basu (1971)’s estimate of the total can be justified as a “super-population” Model Based approach (Meeden and Ghosh, 1983)
Let \(m_i = p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\) \[\begin{align} m_i \mid \pi_i &\mathrel{\mathop{\sim}\limits^{\rm ind}}N(\pi_i \beta, \sigma^2 \pi_i^2) \\ p(\beta, \sigma^2) & \propto 1/\sigma^2 \end{align}\]
posterior mean of \(\beta\) is \(\hat{\beta} = \frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}\) (the HT of the total)
using the posterior predictive for \(m_i \notin S\), \(\textsf{E}[m_i \mid m_j \in S] = \pi_i \hat{\beta}\) \[\begin{align*} C & = \sum_{i \in \boldsymbol{\Gamma}} m_i = \sum_{i \in S} m_i + \sum_{i \notin S} m_i \\ \hat{C} & = \sum_{i \in S} m_i + \sum_{i \notin S} \hat{\beta} \pi_i = \sum_{i \in S} m_i + \left[\frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i} \right] \sum_{i \notin S} \pi_i \end{align*}\]
estimate of posterior probability \({\cal M}_\gamma\) for \({\cal M}_\gamma\in S\) \[ \frac{p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]
estimate of all models in \(\boldsymbol{\Gamma}- S\) from the predictive distribution \[ \frac{\frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]
Uses renormalized marginal likelihoods of sampled models
easy to compute marginal inclusion probabilities
Other mean/variance assumptions for the super-population model lead to other estimates for \(C\), \(p({\cal M}_\gamma\mid \mathbf{Y})\), etc
What about \(\textsf{E}[\boldsymbol{\beta}\mid \mathbf{Y}]\), \(\textsf{E}[\mathbf{X}\boldsymbol{\beta}\mid \mathbf{Y}]\), \(\textsf{E}[\mathbf{Y}^* \mid \mathbf{Y}]\) or \(p(\Delta \mid \mathbf{Y})\)?
The joint posterior distribution of \(\boldsymbol{\gamma}\) (dropping \(\mathbf{Y}\)) may be factored: \[p({\cal M}_\gamma\mid \mathbf{Y}) \equiv p(\boldsymbol{\gamma}\mid \mathbf{Y}) = \prod_{j = 1}^p p(\gamma_j \mid \boldsymbol{\gamma}_{<j}) \] where \(\boldsymbol{\gamma}_{< j}\equiv \{\gamma_k\}\) for \(k < j\) and \(p(\gamma_1 \mid \boldsymbol{\gamma}_{<1}) \equiv p(\gamma_1)\).
As \(\gamma_j\) are binary, re-express as \[\begin{equation*} p(\boldsymbol{\gamma}\mid \mathbf{Y}) = \prod_{j=1}^p(\rho_{j \mid <j})^{\gamma_j}{(1-\rho_{j \mid <j})}^{1-\gamma_j} \end{equation*}\] where \(\rho_{j \mid <j} \equiv \Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j})\) and \(\rho_{1 \mid < 1} = \rho_1\), the marginal probability.
Product of Dependent Bernoullis
Factor proposal \[q(\boldsymbol{\gamma}) = \prod_{j = 1}^p q(\gamma_j \mid \boldsymbol{\gamma}_{<j}) = \prod_j \textsf{Ber}(\hat{\rho}_{j \mid <j}) \]
Note: \(\Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}) = \textsf{E}[\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}]\)
Fit a sequence of \(p\) regressions \(\gamma_j\) on \(\gamma_{<j}\) \[\begin{align*} \gamma_1 & = \mu_1 + \epsilon_1 \\ \gamma_2 & = \mu_2 + \beta_{2 1} (\gamma_1 - \mu_1) + \epsilon_2 \\ \gamma_3 & = \mu_3 + \beta_{3 1} (\gamma_1 - \mu_1) + \beta_{3 2} (\gamma_2 - \mu_2) + \epsilon_3 \\ & \vdots \\ \gamma_p & = \mu_p + \beta_{p 1} (\gamma_1 - \mu_1) \ldots + \beta_{p-1 \, p-1} (\gamma_{p-1} - \mu_{p-1})+ \epsilon_p \end{align*}\]
Approximate model \[\boldsymbol{\gamma}\sim \textsf{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_{\boldsymbol{\gamma}})\]
Wermouth (1980) compositional regression \[ \mathbf{G}= \mathbf{1}_{T} \boldsymbol{\mu}^T + (\boldsymbol{\Gamma}- \mathbf{1}_T \boldsymbol{\mu}^T) \mathbf{B}+ \boldsymbol{\epsilon} \]
\(\mathbf{G}\) is \(T \times p\) matrix where row \(t\) is \(\boldsymbol{\gamma}_t\)
\(\boldsymbol{\mu}\) is the \(p\) dimensional vector of \(\textsf{E}[\boldsymbol{\gamma}]\)
\(\boldsymbol{\Sigma}_{\boldsymbol{\gamma}} = \mathbf{U}^T \mathbf{U}\) where \(\mathbf{U}\) is upper triangular Cholesky decomposition of covariance matrix of \(\boldsymbol{\gamma}\) (\(p \times p\))
\(\mathbf{B}^T = \mathbf{I}_p - diag(\mathbf{U})^{-1} \mathbf{U}^{-T}\) (lower triangle)
\(\mathbf{B}\) is a \(p \times p\) upper triangular matrix with zeros on the diagonal and regression coefficients for \(jth\) regression in row \(j\)
OLS is BLUE and consistent, but \(\mathbf{G}\) may not be full rank
apply Bayesian Shrinkage with “priors” on \(\boldsymbol{\mu}\) (non-informative or Normal) and \(\Sigma\) (inverse-Wishart)
pseudo-posterior mean \(\boldsymbol{\mu}\) is the current estimate of the marginal inclusion probabilities \(\bar{\boldsymbol{\gamma}} = \hat{\boldsymbol{\mu}}\)
use pseudo-posterior mean for \(\boldsymbol{\Sigma}\)
one Cholesky decomposition provides all coefficients for the \(p\) predictions for proposing \(\boldsymbol{\gamma}^*\)
constrain predicted values \(\hat{\rho}_{j \mid <j} \in (\delta, 1-\delta)\)
generate \(\boldsymbol{\gamma}^*_j \mid \boldsymbol{\gamma}^*_{< j} \sim \textsf{Ber}(\hat{\rho}_{j \mid <j})\)
use as proposal for Adaptive Independent Metropolis-Hastings or Importance Sampling (Accept all) -or- Samping Without Replacement (todo)
tecator data (Griffin et al (2021))
a sample of \(p = 20\) variables
compare
same settings burnin.it, MCMC.it, thin
Want to avoid MCMC for
avoid infinite regret
more general models?
Code in development version of BAS on https://github.com/merliseclyde/BAS